Goto

Collaborating Authors

 outlier data






Re-experiment Smart: a Novel Method to Enhance Data-driven Prediction of Mechanical Properties of Epoxy Polymers

Cui, Wanshan, Jeong, Yejin, Song, Inwook, Kim, Gyuri, Kwon, Minsang, Lee, Donghun

arXiv.org Artificial Intelligence

Accurate prediction of polymer material properties through data-driven approaches greatly accelerates novel material development by reducing redundant experiments and trial-and-error processes. To address this limitation, we propose a novel approach to enhance dataset quality efficiently by integrating multi-algorithm outlier detection with selective re-experimentation of unreliable outlier cases. To demonstrate its general applicability, we report the performance improvements across multiple machine learning models, including Elastic Net, SVR, Random Forest, and TPOT, to predict the three key properties. Our method reliably reduces prediction error (RMSE) and significantly improves accuracy with minimal additional experimental work, requiring only about 5% of the dataset to be re-measured. These findings highlight the importance of data quality enhancement in achieving reliable machine learning applications in polymer science and present a scalable strategy for improving predictive reliability in materials science. Introduction Epoxy adhesives are extensively utilized in a wide range of industries, including automotive, aerospace, and civil engineering, due to their robust adhesion to various substrates, exceptional mechanical properties, and high resistance to heat, corrosion, and chemicals 1-4 . Primarily composed of epoxy resin and hardener (curing agent), epoxy adhesives may incorporate additional additives, such as accelerators and fillers, for modification 5 . Epoxy adhesives are formulated by subjecting their compositions to a curing process, which can occur at room temperature, elevated temperature, or through alternative methods such as exposure to UV light 6 .


Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data

Kim, Jang-Hyun, Yun, Sangdoo, Song, Hyun Oh

arXiv.org Artificial Intelligence

Diagnosing and cleaning data is a crucial step for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is challenging due to the presence of complex issues such as label errors, under-representation, and outliers. In this paper, we propose a unified approach for identifying the problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. To this end, we present scalable and effective algorithms for detecting label errors and outlier data based on the relational graph structure of data. We further introduce a visualization tool that provides contextual information of a data point in the feature-embedded space, serving as an effective tool for interactively diagnosing data. We evaluate the label error and outlier/out-of-distribution (OOD) detection performances of our approach on the large-scale image, speech, and language domain tasks, including ImageNet, ESC-50, and SST2. Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in debugging large-scale real-world datasets across various domains. We release codes at https://github.com/snu-mllab/Neural-Relation-Graph.


Three Factors to Improve Out-of-Distribution Detection

Choi, Hyunjun, Chung, JaeHo, Jeong, Hawook, Choi, Jin Young

arXiv.org Artificial Intelligence

In the problem of out-of-distribution (OOD) detection, the usage of auxiliary data as outlier data for fine-tuning has demonstrated encouraging performance. However, previous methods have suffered from a trade-off between classification accuracy (ACC) and OOD detection performance (AUROC, FPR, AUPR). To improve this trade-off, we make three contributions: (i) Incorporating a self-knowledge distillation loss can enhance the accuracy of the network; (ii) Sampling semi-hard outlier data for training can improve OOD detection performance with minimal impact on accuracy; (iii) The introduction of our novel supervised contrastive learning can simultaneously improve OOD detection performance and the accuracy of the network. By incorporating all three factors, our approach enhances both accuracy and OOD detection performance by addressing the trade-off between classification and OOD detection. Our method achieves improvements over previous approaches in both performance metrics.


Data-centric Operational Design Domain Characterization for Machine Learning-based Aeronautical Products

Kaakai, Fateh, Adibhatla, Shridhar "Shreeder", Pai, Ganesh, Escorihuela, Emmanuelle

arXiv.org Artificial Intelligence

We give a first rigorous characterization of Operational Design Domains (ODDs) for Machine Learning (ML)-based aeronautical products. Unlike in other application sectors (such as self-driving road vehicles) where ODD development is scenario-based, our approach is data-centric: we propose the dimensions along which the parameters that define an ODD can be explicitly captured, together with a categorization of the data that ML-based applications can encounter in operation, whilst identifying their system-level relevance and impact. Specifically, we discuss how those data categories are useful to determine: the requirements necessary to drive the design of ML Models (MLMs); the potential effects on MLMs and higher levels of the system hierarchy; the learning assurance processes that may be needed, and system architectural considerations. We illustrate the underlying concepts with an example of an aircraft flight envelope.


Mixture Outlier Exposure: Towards Out-of-Distribution Detection in Fine-grained Environments

Zhang, Jingyang, Inkawhich, Nathan, Linderman, Randolph, Chen, Yiran, Li, Hai

arXiv.org Artificial Intelligence

Many real-world scenarios in which DNN-based recognition systems are deployed have inherently fine-grained attributes (e.g., bird-species recognition, medical image classification). In addition to achieving reliable accuracy, a critical subtask for these models is to detect Out-of-distribution (OOD) inputs. Given the nature of the deployment environment, one may expect such OOD inputs to also be fine-grained w.r.t. the known classes (e.g., a novel bird species), which are thus extremely difficult to identify. Unfortunately, OOD detection in fine-grained scenarios remains largely underexplored. In this work, we aim to fill this gap by first carefully constructing four large-scale fine-grained test environments, in which existing methods are shown to have difficulties. Particularly, we find that even explicitly incorporating a diverse set of auxiliary outlier data during training does not provide sufficient coverage over the broad region where fine-grained OOD samples locate. We then propose Mixture Outlier Exposure (MixOE), which mixes ID data and training outliers to expand the coverage of different OOD granularities, and trains the model such that the prediction confidence linearly decays as the input transitions from ID to OOD. Extensive experiments and analyses demonstrate the effectiveness of MixOE for building up OOD detector in fine-grained environments. The code is available at https://github.com/zjysteven/MixOE.


How Should We Detect and Treat the Outliers?

#artificialintelligence

How do we need to detect outliers? How do we need to treat the outliers? An outlier is that datapoint or observation which behaves very differently from the rest of the data. If we are finding the average net worth of a group of people, and if we find Elon Musk in that group, then the complete analysis will go wrong because of just one outlier. This is a reason why outliers should be treated properly before building a machine learning model.